fix(deps): subdir-agnostic bare cache fixes parallel sparse-checkout race (#1126)#1135
Conversation
…race (#1126) Refactor SharedCloneCache (WS2) to store BARE clones keyed only on (host, owner, repo, ref). Each consumer materializes its own working tree from the bare via 'git clone --local --shared --no-checkout' + 'git checkout HEAD'. Mirrors WS3's GitCache._create_checkout pattern. Pre-fix: two parallel subdir deps from the same upstream repo+ref raced through the cache; one consumer would materialize its subdir into the cached path, then the second consumer would find the cached path without its own subdir, raising RuntimeError("Subdirectory '...' not found in repository"). Post-fix: the cache is subdir-agnostic. Two consumers requesting different subdirs of the same repo+ref share one bare without racing. Per-consumer materialization runs against a unique mkdtemp path. Implementation: - _execute_transport_plan: Strategy template extracted from _clone_with_fallback; both _wt_action (existing GitPython path) and _bare_action (new subprocess-based 3-tier ladder) plug in. - _bare_clone_with_fallback: 3-tier fallback for SHA refs (init+fetch / shallow / full), 2-tier for symbolic refs. - _materialize_from_bare: per-consumer working tree with CRLF + LFS pinning; SHA resolved from --git-dir=<bare> rev-parse HEAD to avoid Windows file-handle leak (no Repo() on consumer dir). - _scrub_bare_remote_url: redacts remote.origin.url to redacted:// after every successful bare clone; logs at WARNING on failure (audit trail for token-leak-aware operators). - WS2 callsite rewritten in download_subdirectory_package; threads _ws2_resolved_commit past the legacy SHA-capture branch. - shared_clone_cache.py: docstring rewritten; APM_DEBUG-gated bare-shape invariant assert. Tests: - 21 unit tests in test_shared_clone_cache.py covering 18 of 19 design scenarios (race, fallback ladder, materialize lifetime, rmtree, invariant, error wording). - New parametrized E2E (test_install_subdir_dedup_e2e.py) with 3 ref variants (symbolic-https / sha-https / default-branch); added to scripts/test-integration.sh. - All 7573 unit tests pass; ruff check + ruff format --check pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
APM Review Panel:
|
| Persona | B | R | N | Takeaway |
|---|---|---|---|---|
| Python Architect | 0 | 2 | 2 | Template Method extraction of _execute_transport_plan is well-factored; bare-action tier ladder is large but correct; no blocking issues. |
| CLI Logging Expert | 0 | 0 | 2 | Logging surface is minimal and correct: stdlib logger for internal audit trail, no token leakage paths, ASCII-clean wording. |
| DevX UX Expert | 0 | 1 | 2 | Internal cache mechanics only; no CLI surface/flag/help/error-wording regressions. Typo-detection promise preserved with regression-trap test. |
| Supply Chain Security Expert | 0 | 2 | 1 | Solid token-hygiene improvement. _scrub_bare_remote_url eliminates the primary on-disk token persistence vector. Residual: FETCH_HEAD in tier-1. |
| OSS Growth Hacker | 0 | 1 | 1 | Bug-fix PR with strong story potential but missing CHANGELOG entry under [Unreleased]. |
| Auth Expert | 0 | 1 | 2 | Auth paths are sound: _execute_transport_plan composes correctly with both _wt_action and _bare_action; ADO bearer flows through; no AuthResolver bypass. |
| Doc Writer | 0 | 1 | 1 | PR fixes a user-observable bug but does not add a CHANGELOG entry. One Fixed line is required. No other doc surface affected. |
| Test Coverage Expert | 0 | 2 | 1 | Race-condition regression trap, error wording, Windows handle-leak guard, and lockfile within-cache parity all covered. ADO bearer + bare path test missing. |
B = blocking-severity findings, R = recommended, N = nits.
Counts are signal strength, not gates. The maintainer ships.
Top 5 follow-ups
- [Doc Writer + OSS Growth Hacker] Add CHANGELOG entry under
## [Unreleased]>Fixedfor [BUG] Parallel sparse-checkout race condition when two subdirectory deps reference the same repository #1126 -- Repo convention requires every merged PR with code changes to have a changelog entry. Two panelists flagged independently. Addressable in-PR before merge -- one line. - [DevX UX Expert] Reword
RuntimeError("Failed to materialize working tree from shared bare: ...")to user-facing language -- Internal jargon ("shared bare") leaks to users on failure path. Quick in-PR fix:"Failed to prepare dependency from cached clone: {e}". - [Supply Chain Security Expert] Truncate
FETCH_HEADinside_scrub_bare_remote_urlto eliminate residual tokenized URL -- Defense-in-depth:FETCH_HEADretains tokenized URL after tier-1 init+fetch. Not exploitable today (0700 tmpdir) but cheap to close. Follow-up issue or in-PR. - [Test Coverage Expert] Add unit test for ADO bearer 401 retry through
_bare_actioncomposition -- Missing test on a security-relevant path (bearer retry). Existing E2E covers_wt_actionbut not_bare_action. Post-merge follow-up issue. - [Test Coverage Expert] Harden lockfile cross-path parity assertion (replace soft
'unknown'guard with hard assert) -- Current E2E silently passes if SHA extraction returns'unknown'-- regression trap has a silent escape hatch. Post-merge hardening.
Architecture
classDiagram
direction LR
class SharedCloneCache {
<<Concurrency Primitive>>
+get_or_clone(host, owner, repo, ref, clone_fn) Path
+cleanup() None
-_entries dict
-_lock Lock
}
class _CacheEntry {
<<ValueObject>>
+lock Lock
+path Path or None
+error Exception or None
}
class GitHubPackageDownloader {
<<Facade>>
+download_subdirectory_package(...)
-_clone_with_fallback(...) Repo
-_execute_transport_plan(...) None
-_bare_clone_with_fallback(...) None
-_materialize_from_bare(...) str
}
class TransportSelector {
<<Strategy>>
+plan_for(url_form, prefs) TransportPlan
}
class TransportPlan {
<<ValueObject>>
+attempts list~TransportAttempt~
}
class GitCache {
<<Persistent Cache>>
+_create_checkout(shard_key, sha, url, env) Path
}
SharedCloneCache *-- _CacheEntry : per-key
GitHubPackageDownloader ..> SharedCloneCache : WS2 path uses
GitHubPackageDownloader ..> GitCache : WS3 path uses
GitHubPackageDownloader ..> TransportSelector : resolves plan
GitHubPackageDownloader ..> TransportPlan : iterates attempts
note for GitHubPackageDownloader "Template Method: _execute_transport_plan\naccepts clone_action callable (wt or bare)"
note for SharedCloneCache "Double-checked locking:\nglobal lock for entry creation,\nper-entry lock for clone-once"
class SharedCloneCache:::touched
class GitHubPackageDownloader:::touched
classDef touched fill:#fff3b0,stroke:#d47600
flowchart TD
A["download_subdirectory_package<br/>src/apm_cli/deps/github_downloader.py"] --> B{WS2 use_shared?}
B -->|Yes| C["SharedCloneCache.get_or_clone<br/>per-key entry lock"]
C --> D{Already cached?}
D -->|Yes| F["Return bare_path"]
D -->|No| E["_bare_clone_with_fallback"]
E --> G["_execute_transport_plan<br/>TransportSelector plan iteration"]
G --> H{"_bare_action: is_commit_sha?"}
H -->|Yes| I["Tier-1: git init --bare +<br/>fetch --depth=1 origin SHA"]
I -->|Fail| J["rmtree partial +<br/>Tier-2: git clone --bare full"]
H -->|No| K["Tier-1: git clone --bare<br/>--depth=1 --branch ref"]
K -->|Fail| L["rmtree partial +<br/>Tier-2: git clone --bare full"]
I -->|OK| M["git update-ref HEAD sha<br/>_scrub_bare_remote_url"]
J --> M
K -->|OK| N["_scrub_bare_remote_url"]
L --> N
M --> F
N --> F
F --> O["_materialize_from_bare"]
O --> P["git rev-parse HEAD from bare"]
P --> Q["git clone --local --shared --no-checkout"]
Q --> R["git config core.autocrlf=false<br/>+ LFS smudge disable"]
R --> S["git checkout HEAD"]
S --> T["Return resolved_sha"]
T --> U["Copy subdir to target_path<br/>robust_copytree"]
B -->|No| V["Legacy per-dep clone path"]
Recommendation
Ship after adding the CHANGELOG entry (one line, in-PR) and optionally rewording the RuntimeError message. Zero blocking findings across eight specialists, strong test coverage with passing regression trap at multiple tiers, and improved security posture. The FETCH_HEAD scrub and ADO bearer test gap are legitimate defense-in-depth items for a follow-up issue but do not represent current exploitable risk. The maintainer should merge once the CHANGELOG line lands.
Full per-persona findings
Python Architect
- [recommended]
_bare_actionclosure is ~100 lines of nested tier logic; extract to a named method for testability atsrc/apm_cli/deps/github_downloader.py:1020
The_bare_actionclosure inside_bare_clone_with_fallbackcapturesgit_exe,is_commit_sha, andreffrom the enclosing scope and contains two branching tier ladders. At ~100 lines it exceeds the inline-when-truly-unique heuristic. Extracting to a private method like_bare_tier_ladder(url, env, target, git_exe, ref, is_commit_sha)would make the tier logic independently unit-testable.
Suggested: Extract_bare_actionbody to a standalone method:def _execute_bare_tiers(self, url, env, target, git_exe, ref, is_commit_sha) -> None. - [recommended] Duplicated git subprocess patterns (init+fetch+update-ref+scrub) could use a small builder or helper at
src/apm_cli/deps/github_downloader.py:1061
The update-ref + scrub suffix is repeated 3 times. A small helper_finalize_bare(target, git_exe, env, sha=None)would DRY the call sites. Not blocking because the repetition is currently bounded. - [nit]
repo_holder: list[Repo] = []pattern in_clone_with_fallbackis an idiomatic Python closure workaround but anonlocalwould be clearer atsrc/apm_cli/deps/github_downloader.py:648
The list-as-mutable-cell pattern is functional but non-obvious. Usingnonlocal repo_resultinside the nested function is more explicit. Trivial change, zero behavioral impact. - [nit]
_materialize_from_bareduplicates the clone+config+checkout sequence fromGitCache._create_checkoutwithout sharing code atsrc/apm_cli/deps/github_downloader.py:1186
Both run the same 3-step pattern. A shared utility incore/would unify the pattern. However, the two live in different dependency layers and the duplication is bounded to ~30 lines, so this is a nit not a recommended -- extract only when a third consumer appears.
CLI Logging Expert
- [nit] Warning message says 'Tokenized URL may persist on disk until shared cache cleanup' -- consider saying 'until process exit' since cache is instance-scoped at
src/apm_cli/deps/github_downloader.py:136
'shared cache cleanup' is technically accurate but suggests an external mechanism the operator must invoke. 'until install completes' or 'until process exit' matches the actual lifecycle more precisely.
Suggested: Replace 'until shared cache cleanup' with 'until install run completes' in both warning messages (lines 136 and 144). - [nit] Second warning interpolates raw Exception repr (
%sonexc) -- considerexc_info=Truefor richer operator debug atsrc/apm_cli/deps/github_downloader.py:140
For a security-audit log, operators benefit from a full traceback, not just the repr.logging.warning(..., exc_info=True)adds the traceback at no cost when log level permits.
DevX UX Expert
- [recommended] New
RuntimeError('Failed to materialize working tree from shared bare: ...')leaks internal jargon to users atsrc/apm_cli/deps/github_downloader.py:617
The error message is written for maintainers, not users. A user hitting this sees 'bare', 'materialize', 'working tree' -- vocabulary from git internals, not APM's install mental model. Recovery guidance is also absent.
Suggested: Reword to:RuntimeError(f"Failed to prepare dependency from cached clone: {e}")or include a recovery hint. - [nit] WARNING log about token redaction failure may confuse users running with verbose logging at
src/apm_cli/deps/github_downloader.py:74
The warning sounds like a security alert the user must act on, but per the code this is best-effort and the cache is ephemeral. Consider downgrading to DEBUG or softening the language. - [nit] APM_DEBUG-only invariant error uses internal paths in the message at
src/apm_cli/deps/shared_clone_cache.py:706
Printing raw temp-dir paths is an anti-pattern. Since it is gated by APM_DEBUG this is acceptable for developer use, but worth noting for consistency.
Supply Chain Security Expert
- [recommended] FETCH_HEAD retains tokenized URL after tier-1 init+fetch (not scrubbed) at
src/apm_cli/deps/github_downloader.py:1051
Aftergit fetch --depth=1 origin <sha>, git writes FETCH_HEAD containing the tokenized URL._scrub_bare_remote_urlonly redactsremote.origin.urlin config; FETCH_HEAD is left untouched. The bare is 0700 and ephemeral, and_materialize_from_bareuses--local --sharedso the consumer's clone won't propagate FETCH_HEAD content. Risk is low but for defense-in-depth consider truncating/removing<bare>/FETCH_HEADinside_scrub_bare_remote_url. - [recommended] Best-effort scrub failure leaves token on disk until cache cleanup at
src/apm_cli/deps/github_downloader.py:127
_scrub_bare_remote_urlusescheck=Falseand catches all exceptions. Code is correctly logging the failure (audit trail). Consider whether a scrub failure on the primary path should escalate to deleting the bare and re-raising -- this would make the operation fail-closed rather than fail-open with a warning. - [nit] Tier-1 window: token in
.git/configbetween remote-add and scrub atsrc/apm_cli/deps/github_downloader.py:1044
Betweengit remote add origin <tokenized_url>and_scrub_bare_remote_url(target, ...), the token lives in the bare's config file on disk. Acceptable given the bare is mode-0700 in a per-session tmpdir, but worth documenting.
OSS Growth Hacker
- [recommended] Missing CHANGELOG entry for [BUG] Parallel sparse-checkout race condition when two subdirectory deps reference the same repository #1126 fix at
CHANGELOG.md:8
The[Unreleased]section is empty. Perchangelog.instructions.md, every merged PR that changes code must have an entry.
Suggested: Add a Fixed entry:**Parallel subdir install race fix:** two subdirectory deps from the same upstream repo+ref no longer intermittently fail with "Subdirectory ... not found" during parallel apm install. (#1126) - [nit] Module docstring is implementation-heavy; consider a one-liner adopters can grep at
src/apm_cli/deps/shared_clone_cache.py
The opening sentence could double as a grep-friendly summary for users debugging install failures.
Auth Expert
- [recommended] Tier-1 init+fetch stores tokenized URL in
.git/configbetween remote-add and scrub; onCalledProcessErrorfrom fetch the scrub never runs atsrc/apm_cli/deps/github_downloader.py:1069
Theshutil.rmtreeIS the correct mitigation but usesignore_errors=True, so if deletion partially fails (Windows file-locking edge case), the tokenized URL persists. Window is narrow.
Suggested: Add_scrub_bare_remote_url(target, git_exe, env)beforeshutil.rmtreein the except clause as defense-in-depth. - [nit]
_materialize_from_barereceives env that's only used for local-only operations where no auth is needed atsrc/apm_cli/deps/github_downloader.py:1186
git clone --local --sharednever contacts a remote, so the auth env vars are dead weight here. Consistent with codebase pattern. Fine. - [nit] Integration test uses direct
os.getenvforGITHUB_APM_PAT-- acceptable for test skip-logic but differs from production AuthResolver flow attests/integration/test_install_subdir_dedup_e2e.py:811
Standard pattern for integration tests. No issue.
Doc Writer
- [recommended] Missing CHANGELOG entry for user-observable bug fix at
CHANGELOG.md
PR fix(deps): subdir-agnostic bare cache fixes parallel sparse-checkout race (#1126) #1135 changes code and tests but does not touch CHANGELOG.md. The fix eliminates an intermittentRuntimeError("Subdirectory '...' not found in repository")duringapm install-- a user-visible failure mode.
Suggested: Under## [Unreleased], add a### Fixedblock:**Parallel subdir install race.** apm install no longer intermittently fails with RuntimeError: Subdirectory '<path>' not found in repository when multiple dependencies resolve to different subdirectories of the same repo@ref. (#1135, fixes #1126) - [nit] No user-facing doc surface affected -- confirmed at
docs/src/content/docs/**
Verified no docs/, README, MANIFESTO, agent/skill/instruction prose, or reference page describes this internal cache layout. No action.
Test Coverage Expert
- [recommended] ADO bearer 401 retry through
_bare_actionhas no unit test attests/unit/deps/test_shared_clone_cache.py
_execute_transport_plan's ADO bearer 401 retry now invokesclone_actionwith_bare_action. No test intests/unit/deps/exercises this specific composition. Existing integrationtest_ado_bearer_e2e.pyonly covers the working-tree path.
Suggested: Addtest_ado_bearer_retry_through_bare_actionmocking the first attempt to raiseCalledProcessErrorwith '401' in stderr.
Proof (missing):tests/unit/deps/test_shared_clone_cache.py::test_ado_bearer_retry_through_bare_action-- proves: ADO bearer 401 fallback composes correctly with_bare_action's rmtree-before-retry invariant [Secure by default, Multi-harness] - [recommended] Lockfile
resolved_commitparity: no cross-path comparison (bare-cache vs legacy) attests/integration/test_install_subdir_dedup_e2e.py
The E2E assertssha_a == sha_bwithin bare-cache path but no test compares against the legacy non-cache path for the same repo+ref. If a future refactor breaks SHA extraction, both sibling deps would get the same wrong SHA and the within-cache parity test would still pass.
Suggested: Add an integration test variant comparing bare-cache vs legacy path; or strengthen the existing E2E to fail-fast whensha_a == 'unknown'.
Proof (missing):tests/integration/test_install_subdir_dedup_e2e.py::test_resolved_commit_matches_legacy_path-- proves: Lockfile determinism across paths [Governed by policy, DevX] - [nit] E2E lockfile assertion uses soft guard that silently passes on
'unknown'attests/integration/test_install_subdir_dedup_e2e.py:139
if sha_a and sha_b and sha_a != 'unknown'means the parity assertion is entirely skipped if SHA extraction is broken. Defeats the stated test purpose.
Suggested: Replace soft guard withassert sha_a and sha_a != 'unknown', 'SHA extraction failed'before the equality assertion.
Proof (passed):tests/integration/test_install_subdir_dedup_e2e.py::test_two_subdirs_same_repo_parallel[symbolic-https]-- proves: Lockfile parity within bare-cache path (but soft guard means broken SHA extraction silently passes) [Governed by policy]
This panel is advisory. It does not block merge. Re-apply the
panel-review label after addressing feedback to re-run.
…ror wording) Two highest-signal in-PR follow-ups from apm-review-panel on PR #1135: 1. Add CHANGELOG.md '### Fixed' entry under [Unreleased] for #1126. Flagged independently by doc-writer and oss-growth-hacker; required by .github/instructions/changelog.instructions.md. 2. Reword RuntimeError raised by _materialize_from_bare from 'Failed to materialize working tree from shared bare: {e}' to 'Failed to prepare dependency from cached clone: {e}' to remove internal-jargon ('materialize', 'working tree', 'shared bare') from a user-facing error message (devx-ux-expert finding). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Reworks WS2 subdirectory dependency dedup to cache subdir-agnostic bare clones instead of mutable working trees, so parallel installs can materialize isolated consumer checkouts without racing each other. This fits the install pipeline by aligning the per-run cache with the existing WS3 bare-checkout pattern.
Changes:
- Replaced shared working-tree clone reuse with shared bare-clone reuse plus per-consumer materialization in
github_downloader.py/shared_clone_cache.py. - Added unit and integration regression coverage for the parallel subdirectory race and the new bare-clone fallback/materialization paths.
- Wired the new integration test into
scripts/test-integration.shand added an unreleased changelog entry.
Show a summary per file
| File | Description |
|---|---|
tests/unit/deps/test_shared_clone_cache.py |
Adds regression/unit coverage for bare-cache behavior, fallback tiers, and materialization semantics. |
tests/integration/test_install_subdir_dedup_e2e.py |
Adds live E2E coverage for parallel sibling subdirectory installs from one repo/ref. |
src/apm_cli/deps/shared_clone_cache.py |
Changes the per-run shared cache to store bare repos and adds a debug invariant check. |
src/apm_cli/deps/github_downloader.py |
Implements bare-clone fallback, per-consumer materialization, and WS2 resolved-commit threading. |
scripts/test-integration.sh |
Adds the new #1126 integration test to the CI integration script. |
CHANGELOG.md |
Adds an unreleased fix entry for the parallel subdirectory race. |
Copilot's findings
Comments suppressed due to low confidence (1)
src/apm_cli/deps/github_downloader.py:1027
- This retry cleanup switched back to raw
shutil.rmtree(..., ignore_errors=True)even though this module's_rmtree()helper exists specifically to absorb transient Windows git/AV locks. If a failed tier leaves a briefly locked file behind, the delete becomes a silent partial cleanup and the nextgit init --bare/git clone --bareattempt can fail on the leftover directory.
if target.exists():
shutil.rmtree(target, ignore_errors=True)
- Files reviewed: 6/6 changed files
- Comments generated: 5
| _ws2_resolved_commit = self._materialize_from_bare( | ||
| shared_bare_path, | ||
| temp_clone_path, | ||
| ref=ref, | ||
| env=self._git_env_dict(), | ||
| known_sha=ref if is_commit_sha else None, |
There was a problem hiding this comment.
Fixed in f5e2fc6. The known_sha=ref optimization is now gated on len(ref) == 40 (matches the same gate added to _bare_action's tier 1 in bare_cache.py). Short SHAs flow through tier 2 where git rev-parse --verify <ref>^{commit} resolves them to a canonical 40-char SHA before update-ref HEAD. Regression test added: tests/unit/deps/test_shared_clone_cache.py::TestBareCloneFallback::test_short_sha_skips_tier1_and_resolves_via_tier2.
| # always start from a clean slate. Behavior change from the | ||
| # pre-#1126 implementation - covered by 6.13. | ||
| if target.exists(): | ||
| shutil.rmtree(target, ignore_errors=True) |
There was a problem hiding this comment.
Fixed in f5e2fc6. Replaced 9 raw shutil.rmtree(..., ignore_errors=True) sites (4 in bare_cache.py, 5 in github_downloader.py -- including line 1027 / _wt_action per the low-confidence finding) with _rmtree(...) which delegates to apm_cli.utils.file_ops.robust_rmtree. That helper retries with backoff and chmod-resets read-only .git/objects/pack/*.pack files (the actual Windows lock case). import shutil removed from both modules. A duplicate _rmtree lives in bare_cache.py to avoid the github_downloader -> bare_cache circular import.
| # (host, owner, repo, ref). Each consumer materializes its | ||
| # own working tree from the bare; this is subdir-agnostic | ||
| # so two parallel consumers requesting different | ||
| # subdirectories of the same repo+ref can share one bare | ||
| # without racing on sparse-checkout. See design.md sec 5.5. |
There was a problem hiding this comment.
Updated in 1555757. The Parallel Downloads section now describes the WS2 bare + per-consumer materialize pipeline (shared bare clone per (host, owner, repo, ref), then git clone --local --shared --no-checkout into each consumer working tree) instead of the stale sparse-checkout copy. packages/apm-guide/.apm/skills/apm-usage/dependencies.md was checked and contains no sparse/bare/materialize mentions, so no resource update was needed.
| # Lockfile parity: same cache key -> same resolved_commit. | ||
| sha_a = getattr(pkg_a, "resolved_commit", None) or getattr(pkg_a, "commit_sha", None) | ||
| sha_b = getattr(pkg_b, "resolved_commit", None) or getattr(pkg_b, "commit_sha", None) | ||
| if sha_a and sha_b and sha_a != "unknown" and sha_b != "unknown": |
There was a problem hiding this comment.
Fixed in 1555757. The integration test now asserts pkg.resolved_reference.resolved_commit directly (the canonical attr from models/dependency/types.py and models/apm_package.py), with hard checks: each must be non-None, not "unknown", exactly 40 chars, and equal across both subdirs. The previous getattr(pkg, "resolved_commit", None) or getattr(pkg, "commit_sha", None) chain silently passed when both attrs were missing; this would have masked any cache-hit propagation regression.
| github_token = os.getenv("GITHUB_APM_PAT") or os.getenv("GITHUB_TOKEN") | ||
| if not github_token: | ||
| pytest.skip("GitHub token required (GITHUB_APM_PAT or GITHUB_TOKEN)") | ||
|
|
||
| if ref_kind == "sha-https": | ||
| ref_value = _resolve_known_sha() | ||
| if ref_value is None: | ||
| pytest.skip("Could not resolve a known SHA on github/awesome-copilot/main") |
There was a problem hiding this comment.
Fixed in 1555757. _resolve_known_sha() now reads GITHUB_TOKEN (or falls back to GITHUB_APM_PAT) and passes it to the gh api subprocess via env={**os.environ, "GH_TOKEN": token}. CI workers will have the PAT but no gh auth login state, so this prevents the sha-https variant from silently skipping in CI just because gh is unauthenticated.
|
[Windows portability paper audit] Paper audit of the diff for Windows-specific risks. PR-CI does not exercise
Summary verdict: PASS WITH FOLLOWUPS. One must-fix BUG (item 4: raw Recommended fix for item 4 (one-line change at each of 4 sites): # Replace at L658, L1027, L1071, L1112:
- shutil.rmtree(target, ignore_errors=True)
+ _rmtree(target)
Tracking issue recommended (not opened by this audit -- maintainer call) so item 10 doesn't get lost. Item 4 should land before merge given Windows CI runs only post-merge on (Audit performed against tip |
…ear file-length guardrail Splits github_downloader.py (was 2751 lines, exceeded 2400-line CI guardrail) by moving bare-cache helpers and the clone-failure error-message builder into a new src/apm_cli/deps/bare_cache.py module: - _scrub_bare_remote_url - bare_clone_with_fallback - materialize_from_bare - clone_with_fallback (parameterized via repo_cls for test patchability) - build_clone_failure_message GitHubPackageDownloader retains thin delegating instance methods (_clone_with_fallback, _bare_clone_with_fallback, _materialize_from_bare) so existing patch.object call sites continue to work unchanged. Existing tests that patch 'apm_cli.deps.github_downloader.Repo' continue to function because the thin wrappers resolve Repo from the github_downloader module namespace at call time and pass it as repo_cls. github_downloader.py: 2751 -> 2383 lines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n hardening (#1135) * Short-SHA correctness (Copilot review): bare-cache tier 1 (init+fetch by SHA) requires a full 40-char SHA — git's smart-HTTP cannot resolve abbreviations and update-ref refuses partial refs. Skip tier 1 for short SHAs and resolve them via tier-2 'rev-parse --verify <ref>^{commit}', feeding the canonical 40-char SHA into update-ref HEAD. Also gate the consumer-side 'known_sha=ref' optimization on len==40. * Windows portability (Copilot review): replace 9 raw shutil.rmtree call sites in bare_cache.py and github_downloader.py with the local _rmtree helper that delegates to robust_rmtree (chmod-retries read-only .git/objects/pack/* files). Adds a duplicate _rmtree to bare_cache.py to avoid the github_downloader -> bare_cache circular import. * FETCH_HEAD supply-chain scrub (panel follow-up): tier-1 init+fetch leaves the tokenized URL inside .git/FETCH_HEAD even after the remote.origin.url config redaction. Truncate FETCH_HEAD to empty in _scrub_bare_remote_url so the token does not survive on disk. * ADO bearer retry on bare 401 (panel follow-up): subprocess CalledProcessError str() does NOT include stderr, so the existing 401/Authentication-failed marker check in _execute_transport_plan never matched the bare-clone path. Append e.stderr to err_msg so bearer retry composes with bare clones identically to the GitPython-based working-tree path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…update bare+materialize pipeline (#1135) * Bare-cache short-SHA test: assert tier 1 (init+fetch) is skipped for abbreviated SHAs and the resolved 40-char SHA from rev-parse drives update-ref HEAD, not the user-supplied abbreviation. * Bare-cache tier-2 test: stub rev-parse --verify stdout with the full 40-char SHA so the test exercises the new 'use resolved SHA, not input ref' code path. Existing tier-1 fixture now uses a real 40-char hex SHA (the prior fixture was 43 chars, which is now correctly rejected by the len()==40 gate). * FETCH_HEAD scrub tests: assert _scrub_bare_remote_url truncates FETCH_HEAD when present (defense-in-depth supply-chain) and is a no-op when absent (tier-2 full-clone path). * ADO bearer 401 retry on bare clone: end-to-end test that PAT failure with 'Authentication failed' in stderr triggers bearer retry through the bare clone action and the stale-PAT diagnostic fires on success. * Integration test (subdir dedup E2E): - lockfile parity: assert pkg.resolved_reference.resolved_commit is a 40-char SHA and identical across both subdirs (was a soft getattr fallback that silently passed if attr was missing). - sha-https variant: pass GH_TOKEN env to the gh api subprocess so it works under CI workers without ambient gh auth login state. * Docs: update Parallel Downloads section to describe the WS2 bare + per-consumer materialize pipeline (replaces stale sparse-checkout copy). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Folded in Copilot review + panel follow-upsCommits: f5e2fc6 (src), 1555757 (tests + docs) Copilot review findings (5/5 addressed)
Panel follow-ups
Verification
Each Copilot review thread received an inline reply pointing at the fix commit. Watching CI for green confirmation. |
Local verification matrixRe-ran the full suite locally on macOS / Python 3.12.9 against real Integration tests against real GitHub remote (
|
| Test | Ref kind | What it proves | Result |
|---|---|---|---|
test_two_subdirs_same_repo_parallel[symbolic-https-main] |
ref="main" |
The original race fix end-to-end: two consumers of the same repo@main resolve sibling subdirs without colliding |
PASS |
test_two_subdirs_same_repo_parallel[default-branch-None] |
no ref |
Default-branch path through _bare_clone_with_fallback (tier 1 shallow clone) |
PASS |
test_two_subdirs_same_repo_parallel[sha-https-RESOLVE_AT_RUNTIME] |
40-char commit SHA resolved at fixture-setup via gh api |
3-tier SHA fallback in bare_cache.py (init+fetch → init+fetch+rev-parse → full clone+rev-parse) and the bare-derived resolved-commit lockfile parity |
PASS |
3 passed in 17.79s.
Unit tests for the new module surface (tests/unit/deps/test_shared_clone_cache.py)
| Class | Coverage area | Tests | Result |
|---|---|---|---|
TestSharedCloneCache |
Cache fundamentals: single-clone, dedup, divergent refs, failure propagation, lock serialization, lifecycle | 6 | PASS |
TestDownloaderSharedCloneIntegration |
Two subdir deps share a single bare clone via the downloader | 1 | PASS |
TestBareCacheRaceCondition |
Barrier-deterministic parallel-different-subdirs reproducer (the bug) | 1 | PASS |
TestBareCloneFallback |
All 3 tiers: SHA tier 1 init+fetch, SHA tier 2 fetch-rejection, short-SHA tier 2, symbolic shallow tier 1 | 4 | PASS |
TestMaterializeFromBare |
git clone --local --shared --no-checkout semantics: real bare materialization, SHA from bare not consumer (lifetime invariant), known-SHA shortcut, LFS smudge disabled, autocrlf=false pinned before checkout |
5 | PASS |
TestSharedCloneCacheBareInvariant |
APM_DEBUG accepts bare / rejects non-bare clone shapes |
2 | PASS |
TestExecuteTransportPlanWtAction |
Working-tree action handles missing target | 1 | PASS |
TestBareCloneRetryRmtree |
_bare_action rmtrees target before init (Windows pack-file portability) |
1 | PASS |
TestInvalidSubdirErrorWording |
Typo subdir surfaces Subdirectory '...' not found (UX contract) |
1 | PASS |
TestBareScrubFetchHead |
Defense-in-depth: _scrub_bare_remote_url truncates FETCH_HEAD when present and is no-op when absent |
2 | PASS |
TestAdoBareBearerRetry |
ADO PAT 401 → bearer retry on the bare-clone path (this test caught a real prod bug: str(CalledProcessError) omits stderr, so the retry never fired before this commit) |
1 | PASS |
25 passed in 4.06s.
Regression sweep (no behavior changes outside the WS2 path)
Ran the broader install + dependency real-remote integration suites to confirm the bare+materialize pipeline does not regress unrelated install flows (lockfile, dry-run, verbose-redaction, link-mode, local-bundle, single/multi-dep install, branch refs, error handling, network failure, CLI commands, dep update workflow, cross-platform download).
| Suite | Tests | Result |
|---|---|---|
tests/integration/test_apm_dependencies.py |
15 | 13 pass, 1 skip (binary_compatibility_with_dependencies requires built binary), 1 unrelated env failure (test_authentication_token_handling asserts token starts with github_pat_; local gh auth token returns a gho_ OAuth token — pure env quirk, exists on main too) |
tests/integration/test_install_local_bundle_e2e.py |
19 | PASS |
tests/integration/test_install_dry_run_e2e.py |
5 | PASS |
tests/integration/test_install_verbose_redaction_e2e.py |
4 | PASS |
tests/integration/test_install_with_links.py |
2 | PASS |
30/30 install-flow regression tests pass; 13/13 real-remote dependency-install behavior tests pass.
Race-coverage stratification (the "simulate properly" gate)
Two layers, intentional:
| Layer | Mechanism | Why both | Status |
|---|---|---|---|
| Unit (deterministic) | threading.Barrier(2) in TestBareCacheRaceCondition forces lock contention in SharedCloneCache.get_or_clone regardless of timing |
Fast, hermetic, runs on every PR via CI | PASS |
| Integration (real BFS) | ThreadPoolExecutor(max_workers=4) BFS hits github/awesome-copilot with two sibling subdirs sharing one cache key |
Catches integration drift that mocked unit tests can't see; verifies the production code path against a real git remote | PASS |
Honest gaps (not run, declared)
- Windows execution:
build-release.yml(the only workflow withwindows-latest) does not trigger on PRs in this repo, only on push-to-main and tags. Windows coverage on this branch is via paper audit + the_rmtree/CRLF/LFS pinning fixes; first real Windows run will happen post-merge onmain. - ADO E2E: no live ADO remote in the integration suite; ADO bearer-retry path covered by unit test
test_bare_clone_recovers_via_ado_bearer_after_pat_401only.
Worktree: /Users/danielmeppiel/Repos/apm-1126-fix on 1555757b.
Roll forward the four PRs merged since v0.12.1: - #1137 feat(audit): default-on integration drift detection - #1135 fix(deps): subdir-agnostic bare cache (parallel sparse-checkout race) also resolves duplicate report #1140 (ADO sub-path manifestation) - #1129 docs: first-package guide accuracy (includes: auto, skill paths) - #1127 docs: APM's role for skills, plugins-as-packages, ADO sub-paths Bump pyproject.toml + uv.lock 0.12.1 -> 0.12.2 and convert the Unreleased CHANGELOG block into the 0.12.2 section, with a single 'so what' line per merged PR per the changelog contract. Co-authored-by: Daniel Meppiel <copilot-rework@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #1126.
TL;DR
The shared-clone cache (WS2, introduced by PR #1116 as part of the install perf overhaul) was keyed on
(host, owner, repo, ref, subdir)and stored subdir-pinned working trees populated viagit checkout -- <subdir>. When two dependencies pointed at the same upstreamrepo@refbut different subdirectories, parallel installs raced through the cache: consumer A's checkout could be overwritten by consumer B's checkout before A finished copying its files, intermittently raising:This PR rebuilds the cache around the same paradigm WS3's
GitCache._create_checkoutalready uses successfully: the cache stores subdir-agnostic BARE clones, keyed only on(host, owner, repo, ref). Each consumer materializes its own private working tree from the bare viagit clone --local --shared --no-checkout+git checkout HEAD. Sibling consumers no longer share mutable working-tree state, so the race goes away by construction.Problem (WHY)
PR #1116 introduced
SharedCloneCacheto dedup git clones across the dependency graph duringapm install. The original design cached the post-checkout working tree keyed on(host, owner, repo, ref, subdir). The race manifested when two deps shared(host, owner, repo, ref)but specified different subdirs:(host, owner, repo, ref, "skills/X"), populates its checkout, releases the lock, and starts copying files out.git checkout -- skills/Y, mutating the working tree A is still reading from.os.path.exists(repo_dir / "skills/X")check now fails because B's checkout truncated/overwrote files A expected.The per-key lock on
SharedCloneCachewas correct for cache population, but did not protect against post-population mutation when two cache keys map to the same on-disk repository. The fix is structural: stop sharing post-checkout state at all.Approach (WHAT)
Mirror the WS3
GitCachepattern that already proves out in production:SharedCloneCachekeys on(host, owner, repo, ref)only. No subdir component. Two deps that share repo+ref get the same cache entry.git clone --local --shared --no-checkout <bare> <consumer>+git checkout HEAD. The--local --sharedmode reuses the bare's object store (no disk-space cost) but gives each consumer an isolated working tree and HEAD.This is the exact pattern in
src/apm_cli/cache/git_cache.py::_create_checkout. We are not inventing a new approach; we are unifying WS2 with WS3.Implementation (HOW)
Components touched
flowchart TD A["download_subdirectory_package<br/>github_downloader.py"] --> B{WS2 use_shared?} B -->|Yes| C["SharedCloneCache.get_or_clone<br/>per-key entry lock"] C --> D{Already cached?} D -->|Yes| F["Return bare_path"] D -->|No| E["_bare_clone_with_fallback"] E --> G["_execute_transport_plan<br/>Strategy template"] G --> H{"_bare_action: is_commit_sha?"} H -->|Yes| I["Tier-1: git init --bare +<br/>fetch --depth=1 origin SHA"] I -->|Fail| J["rmtree partial +<br/>Tier-2: git clone --bare full"] H -->|No| K["Tier-1: git clone --bare<br/>--depth=1 --branch ref"] K -->|Fail| L["rmtree partial +<br/>Tier-2: git clone --bare full"] I -->|OK| M["git update-ref HEAD sha<br/>_scrub_bare_remote_url"] J --> M K -->|OK| N["_scrub_bare_remote_url"] L --> N M --> F N --> F F --> O["_materialize_from_bare"] O --> P["git rev-parse HEAD from bare"] P --> Q["git clone --local --shared<br/>--no-checkout"] Q --> R["git config core.autocrlf=false<br/>+ LFS smudge disable"] R --> S["git checkout HEAD"] S --> T["Return resolved_sha"] T --> U["Copy subdir to target_path"] B -->|No| V["Legacy per-dep clone path"]Key functions
SharedCloneCache(shared_clone_cache.py): cache key narrowed to(host, owner, repo, ref). Bare temp-dir name changed tobare. OptionalAPM_DEBUG=1invariant assert that the cached path is a bare repo._bare_clone_with_fallback(github_downloader.py): new entry point. Builds a transport plan, then plugs_bare_action(subprocess-based) into the shared_execute_transport_plantemplate (which the existing working-tree_clone_with_fallbackalso uses via_wt_action). Two-tier ladder per attempt: tier-1 attempts shallow (init+fetch for SHA refs,--depth=1 --branchfor symbolic refs); onCalledProcessErrorrmtrees partial state and falls back to tier-2 full bare clone. After every successful clone,_scrub_bare_remote_urlredactsremote.origin.urltoredacted://so any tokenized URL is removed from the bare's.git/config._materialize_from_bare(github_downloader.py): runsgit rev-parse HEADagainst the bare (--git-dir=<bare>) to obtain the resolved SHA without ever opening aRepo()on the consumer dir (avoids Windows handle leaks that block downstream rmtree). Thengit clone --local --shared --no-checkoutfollowed bygit checkout HEAD._execute_transport_plan(refactored): the existing transport-plan iteration logic was extracted into a Strategy template that accepts aclone_actioncallable._wt_action(working tree, GitPython-based) and_bare_action(bare, subprocess-based) plug in. ADO bearer 401 retry composes correctly with both strategies. Catches(GitCommandError, subprocess.CalledProcessError)because the two strategies raise different exception types._ws2_resolved_committhread-through: the resolved SHA from_materialize_from_bareis threaded past the legacyRepo(temp_clone_path).head.commit.hexshablock, so the lockfile records the same commit the bare-cache path produced.Trade-offs
(repo, ref, subdir)got its own working tree. Net win on disk (objects deduplicated via--local --shared).git clone --local --shared+git checkout HEAD(single-process, local FS). On modern hardware this is sub-100ms even for large repos. Acceptable cost for correctness._wt_action+_bare_action) coexist temporarily. The legacy non-cache path still uses_wt_action; the WS2 path uses_bare_action. Both compose with the shared_execute_transport_plan. Future cleanup could converge on bare-only, but is out of scope here._bare_actionclosure is ~100 lines (panel feedback). Logic is correct; extracting to a named method for testability is a recommended follow-up.Validation evidence
uv run --extra dev ruff check src/ tests/anduv run --extra dev ruff format --check src/ tests/both silent. Verified locally before push.uv run pytest tests/unit/deps/test_shared_clone_cache.py -q-> 21 passed in 2.17s. Full unit run -> 7573 passed.tests/unit/deps/test_shared_clone_cache.py):TestBareCacheRaceCondition::test_parallel_different_subdirs_both_succeed-- regression trap for the original race; two parallel materializes against the same bare both succeed.TestBareCloneFallback-- tier-1 -> tier-2 fallback for both SHA and symbolic refs; coverssubprocess.CalledProcessErrortriggering rmtree-then-retry.TestMaterializeFromBare::test_consumer_resolved_sha_obtained_from_bare_not_consumer-- regression trap that SHA is read from--git-dir=<bare>, notRepo(consumer_dir). Prevents Windows handle-leak regressions.TestSharedCloneCacheBareInvariant--APM_DEBUG=1invariant: cached path must be a bare repo.TestExecuteTransportPlanWtAction+TestBareCloneRetryRmtree-- rmtree-before-retry semantics on both_wt_actionand_bare_action.TestInvalidSubdirErrorWording::test_typo_subdir_raises_subdirectory_not_found-- the user-facing typo-detection error message ("Subdirectory '...' not found in repository") is preserved through the bare-cache path.tests/integration/test_install_subdir_dedup_e2e.py): 3-variant parametrized test (symbolic-https,default-branch,sha-https) that drives paralleldownload_subdirectory_packageagainstgithub/awesome-copilotwith two sibling subdirs, asserting both succeed and produce the sameresolved_commit. Wired intoscripts/test-integration.sh::run_e2e_tests. Skips whenGITHUB_APM_PAT/GITHUB_TOKENis absent. SHA for thesha-httpsvariant is resolved at runtime viagh api(no fabricated constants).How to test
Panel review
Apm-review-panel was run; full advisory comment posted above. Stance:
ship_with_followups. Zero blocking findings across eight specialist personas. Two highest-signal in-PR follow-ups already addressed in commitd5f51012:## [Unreleased]>### Fixed(flagged by doc-writer + oss-growth-hacker independently).RuntimeErroruser-facing wording ("Failed to materialize working tree from shared bare"->"Failed to prepare dependency from cached clone") per devx-ux-expert finding.Remaining recommended follow-ups are post-merge improvements: FETCH_HEAD truncation in
_scrub_bare_remote_url(defense-in-depth, supply-chain-security); ADO bearer 401 retry through_bare_actionunit test (test-coverage); cross-path lockfile parity test (test-coverage);_bare_actionclosure extraction (python-architect).